Hi, I’m Ying! Do you love Chocolate? I will not be surprised if you say “YES”! Chocolate is fantastic. It has a beautiful taste and makes people feel happy. According to Healthline, studies show that dark Chocolate can improve one’s health and lower the risk of heart disease. So as Christmas is around the corner, let’s look at Chocolate and find the mystery behind the fantastic sweets!
I focus on Chocolate and use the dataset called Chocolate Bar Ratings from https://www.kaggle.com/rtatman/chocolate-bar-ratings.
Following are the introduction of the dataset: Each year, residents of the United States collectively eat more than 2.8 billion pounds. However, not all chocolate bars are created equal! This dataset contains expert ratings of over 1,700 individual chocolate bars, along with information on their regional origin, percentage of cocoa, the variety of chocolate beans used, and where the beans were grown.
About the dataset: This dataset contains expert ratings of over 1,700 individual chocolate bars, along with information on their regional origin, percentage of cocoa, the variety of chocolate beans used, and where the beans were grown. A rating here only represents an experience with one bar from one batch and represents the overall experience of flavor, texture, and the after melt of the Chocolate.
Using this dataset, I invited you to learn more about the charming Chocolate!
In the section below, I will explain how I proceeded with this project and how I did the data cleaning.
1.Data Sources To find out the answers, I import data Chocolate Bar Ratings(https://www.kaggle.com/rtatman/chocolate-bar-ratings), all the data is downloaded directed from the website, not pre-clean with the data. Everything was done in R as you can see in later steps.
2.Data Cleaning In order to do the analysis, I firstly load the libraries and proceed with checking the data.
#require(tm)
require(tidyverse)
## Loading required package: tidyverse
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.4 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
require(lubridate)
## Loading required package: lubridate
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
require(skimr)
## Loading required package: skimr
require(kableExtra)
## Loading required package: kableExtra
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
require(ggplot2)
require(RColorBrewer)
## Loading required package: RColorBrewer
library(here)
## here() starts at /Users/cassie/Documents/UT学习/21FA/R/Week14/Final
library(tidyverse)
library(anytime)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
require(ggplot2)
library(grid)
library(gridExtra)
library(kableExtra)
library(wordcloud)
library(corrplot)
## corrplot 0.91 loaded
#library(tm)
There are lots of interesting things to be found for this dataset.
My personal interests are…
Which companies produce good chocolate ( want to buy some for the Christmas!) ?
What is the average rate of the chocolate?
Where are good chocolates comes from?
What is the relationship between cocoa percentage and the quality of chocolate?
Can we predict the rating of a chocolate?
Import the dataset and see its structure.
ChocolateData <- read.csv("../data/flavors_of_cacao.csv")
str(ChocolateData)
## 'data.frame': 1795 obs. of 9 variables:
## $ Company...Maker.if.known. : chr "A. Morin" "A. Morin" "A. Morin" "A. Morin" ...
## $ Specific.Bean.Origin.or.Bar.Name: chr "Agua Grande" "Kpime" "Atsane" "Akata" ...
## $ REF : int 1876 1676 1676 1680 1704 1315 1315 1315 1319 1319 ...
## $ Review.Date : int 2016 2015 2015 2015 2015 2014 2014 2014 2014 2014 ...
## $ Cocoa.Percent : chr "63%" "70%" "70%" "70%" ...
## $ Company.Location : chr "France" "France" "France" "France" ...
## $ Rating : num 3.75 2.75 3 3.5 3.5 2.75 3.5 3.5 3.75 4 ...
## $ Bean.Type : chr " " " " " " " " ...
## $ Broad.Bean.Origin : chr "Sao Tome" "Togo" "Togo" "Togo" ...
summary(ChocolateData)
## Company...Maker.if.known. Specific.Bean.Origin.or.Bar.Name REF
## Length:1795 Length:1795 Min. : 5
## Class :character Class :character 1st Qu.: 576
## Mode :character Mode :character Median :1069
## Mean :1036
## 3rd Qu.:1502
## Max. :1952
## Review.Date Cocoa.Percent Company.Location Rating
## Min. :2006 Length:1795 Length:1795 Min. :1.000
## 1st Qu.:2010 Class :character Class :character 1st Qu.:2.875
## Median :2013 Mode :character Mode :character Median :3.250
## Mean :2012 Mean :3.186
## 3rd Qu.:2015 3rd Qu.:3.500
## Max. :2017 Max. :5.000
## Bean.Type Broad.Bean.Origin
## Length:1795 Length:1795
## Class :character Class :character
## Mode :character Mode :character
##
##
##
head(ChocolateData, 10)
## Company...Maker.if.known. Specific.Bean.Origin.or.Bar.Name REF Review.Date
## 1 A. Morin Agua Grande 1876 2016
## 2 A. Morin Kpime 1676 2015
## 3 A. Morin Atsane 1676 2015
## 4 A. Morin Akata 1680 2015
## 5 A. Morin Quilla 1704 2015
## 6 A. Morin Carenero 1315 2014
## 7 A. Morin Cuba 1315 2014
## 8 A. Morin Sur del Lago 1315 2014
## 9 A. Morin Puerto Cabello 1319 2014
## 10 A. Morin Pablino 1319 2014
## Cocoa.Percent Company.Location Rating Bean.Type Broad.Bean.Origin
## 1 63% France 3.75 Sao Tome
## 2 70% France 2.75 Togo
## 3 70% France 3.00 Togo
## 4 70% France 3.50 Togo
## 5 70% France 3.50 Peru
## 6 70% France 2.75 Criollo Venezuela
## 7 70% France 3.50 Cuba
## 8 70% France 3.50 Criollo Venezuela
## 9 70% France 3.75 Criollo Venezuela
## 10 70% France 4.00 Peru
Now we have the dataframe, We have 1795 observations and 9 variables. but I don’t like the way how it looks.
I want to change…
Change the colnames to be more readable
Change the properer data type
Deal with missing value
Delete “REF”, I will not use this variable
colnames(ChocolateData) <- c("Company", "BarOrigin", "REF", "ReviewDate", "CocoaPct", "Loc", "Rating", "Type", "BeanOrigin")
ChocolateData$CocoaPct <- gsub("[%]", "", ChocolateData$CocoaPct)
ChocolateData$CocoaPct <- as.numeric(ChocolateData$CocoaPct)
ChocolateData[, c(8,9)] <- sapply(ChocolateData[,c(8,9)], str_trim)
is.na(ChocolateData) <- ChocolateData==''
ChocolateData <- ChocolateData[, -3]
str(ChocolateData)
## 'data.frame': 1795 obs. of 8 variables:
## $ Company : chr "A. Morin" "A. Morin" "A. Morin" "A. Morin" ...
## $ BarOrigin : chr "Agua Grande" "Kpime" "Atsane" "Akata" ...
## $ ReviewDate: int 2016 2015 2015 2015 2015 2014 2014 2014 2014 2014 ...
## $ CocoaPct : num 63 70 70 70 70 70 70 70 70 70 ...
## $ Loc : chr "France" "France" "France" "France" ...
## $ Rating : num 3.75 2.75 3 3.5 3.5 2.75 3.5 3.5 3.75 4 ...
## $ Type : chr NA NA NA NA ...
## $ BeanOrigin: chr "Sao Tome" "Togo" "Togo" "Togo" ...
head(ChocolateData, 10)
## Company BarOrigin ReviewDate CocoaPct Loc Rating Type BeanOrigin
## 1 A. Morin Agua Grande 2016 63 France 3.75 <NA> Sao Tome
## 2 A. Morin Kpime 2015 70 France 2.75 <NA> Togo
## 3 A. Morin Atsane 2015 70 France 3.00 <NA> Togo
## 4 A. Morin Akata 2015 70 France 3.50 <NA> Togo
## 5 A. Morin Quilla 2015 70 France 3.50 <NA> Peru
## 6 A. Morin Carenero 2014 70 France 2.75 Criollo Venezuela
## 7 A. Morin Cuba 2014 70 France 3.50 <NA> Cuba
## 8 A. Morin Sur del Lago 2014 70 France 3.50 Criollo Venezuela
## 9 A. Morin Puerto Cabello 2014 70 France 3.75 Criollo Venezuela
## 10 A. Morin Pablino 2014 70 France 4.00 <NA> Peru
Cool, I like how it looks now.
Now we have 8 variables. Let’s look at distributions of these variables. I will plot the categorical variables as bar charts, showing the most popular values and see what we could find.
#top companies
Top_companies <- ChocolateData %>%
group_by(Company) %>%
summarise(Count= n())%>%
top_n(10, wt = Count) %>%
arrange(desc(Count))
ggplot(Top_companies, aes(reorder(Company, Count), Count, fill = Count)) +
coord_flip() +
geom_bar(stat = "identity", size = 0.1)+xlab("Top_Companies")
#Review_cocoa_date
Review_cocoa <- ChocolateData %>%
group_by(ReviewDate) %>%
summarise(Count= n())
ggplot(Review_cocoa, aes(x =factor(ReviewDate), y = Count, fill = Count)) +
geom_bar(stat = "identity", size = 0.1) +
xlab("Review Date") +
coord_flip()
#BarOrigin
BarOrigin_new <- ChocolateData %>%
group_by(BarOrigin) %>%
summarise(Count= n()) %>%
top_n(10, wt = Count) %>%
arrange(desc(Count))
ggplot(BarOrigin_new, aes(reorder(BarOrigin, Count), Count, fill = Count)) +
coord_flip() +
geom_bar(stat = "identity", size = 0.1) + xlab("Top_BarOrigin")
#BeanTypes
BeanTypes <- ChocolateData %>%
group_by(Type) %>%
na.omit() %>%
summarise(Count= n()) %>%
mutate(pct=Count/sum(Count)) %>%
top_n(10, wt = pct)
ggplot(BeanTypes, aes(x =reorder(Type,pct), y =pct, fill = pct)) +
geom_bar(stat = "identity", size = 0.1) +
coord_flip() +
xlab("Bean_Type") +ylab("Percentage")
So we can find…
After finishing these chart, I have a basic understanding about the data, now I will go to my questions and try to find answers.
For my first question: Which companies produce good chocolate(want to buy some for Christmas!)?
Company_rating <- ChocolateData %>%
group_by(Company) %>%
summarize(rating = mean(Rating), count = n()) %>%
arrange(desc(count),desc(rating));
head(Company_rating, n = 10)
## # A tibble: 10 × 3
## Company rating count
## <chr> <dbl> <int>
## 1 Soma 3.59 47
## 2 Bonnat 3.44 27
## 3 Fresco 3.38 26
## 4 Pralus 3.28 25
## 5 A. Morin 3.38 23
## 6 Arete 3.53 22
## 7 Domori 3.48 22
## 8 Guittard 3.17 22
## 9 Valrhona 3.33 21
## 10 Hotel Chocolat (Coppeneur) 2.97 19
Companies <- ChocolateData %>%
group_by(Company) %>%
filter(n() > 10) %>%
mutate(avg = mean(Rating))
Companies %>%
ggplot(aes(x = reorder(as.factor(Company), Rating, FUN = mean), y = Rating)) +
geom_point(aes(x = as.factor(Company), y = avg, colour = avg)) +
geom_count(alpha = .1) +
coord_flip() +
labs(x = "Company", y = "Rating")
Take review times and average rating into consideration, Soma is NO.1. Amedel is excellent, but the sample size is rather small. Therefore, I think the most easy to buy Chocolate with high quality is more likely to be found in Soma. Let’s buy Soma!
After find our the company, I want to understand the average performance of chocolate rating. So I draw a bar chart to find the distribution.
ggplot(ChocolateData, aes(factor(Rating))) +
geom_bar(fill = "steelblue") +
xlab("Rating")
summary(ChocolateData$Rating)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.875 3.250 3.186 3.500 5.000
Flavors of Cacao Rating System: 5= Elite (Transcending beyond the ordinary limits) 4= Premium (Superior flavor development, character and style) 3= Satisfactory(3.0) to praiseworthy(3.75) (well made with special qualities) 2= Disappointing (Passable but contains at least one significant flaw) 1= Unpleasant (mostly unpalatable)
Most of the rates are between 2.75-3.75, I’m happy to see we have most rating lie around 3.5, which is Satisfactory! This makes me feel more confident about buy chocolate randomly from any company without worrying too much about its flavor. Also, mean of the rating are 3.186, median is 3.25, we find the answer of what is the average rate of chocolate.
However, the number is not straightforward, I want to make it readable by adding notes to it. First, rearrange ratings to 5 groups.
Rating_Pct_Com <- data.frame(RatingLev = c("Unpleasant","Disappointing","Satisfactory-Praiseworthy","Premium","Elite"),
Rating = c("1 <= Rating < 2", "2 <= Rating < 3", "3 <= Rating <= 3.75", "3.75 < Rating <= 5", "Rating = 5"),
Note = c("Elite (Transcending beyond the ordinary limits)",
"Premium (Superior flavor development, character and style)",
"Satisfactory(3.0) to praiseworthy(3.75) (well made with special qualities)",
"Disappointing (Passable but contains at least one significant flaw)",
"Unpleasant (mostly unpalatable)"))
kbl(Rating_Pct_Com, caption = "Chocolate Rating Description") %>%
kable_classic(html="Cambria", full_width=FALSE)
| RatingLev | Rating | Note |
|---|---|---|
| Unpleasant | 1 <= Rating < 2 | Elite (Transcending beyond the ordinary limits) |
| Disappointing | 2 <= Rating < 3 | Premium (Superior flavor development, character and style) |
| Satisfactory-Praiseworthy | 3 <= Rating <= 3.75 | Satisfactory(3.0) to praiseworthy(3.75) (well made with special qualities) |
| Premium | 3.75 < Rating <= 5 | Disappointing (Passable but contains at least one significant flaw) |
| Elite | Rating = 5 | Unpleasant (mostly unpalatable) |
#Where are good chocolates comes from? Next, I want to find out where are good Chocolate come from? First, I create a wordcloud for the company loaction because I always like word cloud and it is really cool! Then, I draw a boxplot to see the relationship between company location and rating.
word_choc <- gsub(" ", "",ChocolateData$Loc)
wordcloud(word_choc, max.words = 200, random.order = FALSE, scale = c(4,0.7), rot.per = 0.5, colors = brewer.pal(8, "Dark2"))
## Loading required namespace: tm
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents
This is cool, we know the Top countries in a visual way. Do these countries have the best chocolates? I want to create a boxplot to see the distributions about countries and ratings.I decide to display those countries that have been rated for more than 5 times.
ChocolateData %>%
group_by(Loc) %>%
filter(n() > 5) %>%
mutate(avg = mean(Rating)) %>%
ggplot() +
geom_boxplot(aes(reorder(Loc, avg), Rating, fill = avg)) +
scale_fill_continuous(low = "#132B43", high = "#56B1F7", name = "Average rating") +
coord_flip() +
labs(x = "Company Location", y = "Rating")
summary(ChocolateData$Rating)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.875 3.250 3.186 3.500 5.000
So we can tell that the good chocolate come for Australia, Switzerland, Italy and Canada.
Cocoa percent Let’s investigate to see whether there is a relationship between cocoa percentage and the chocolate’s rating. My assumption : the higher the cocoa percentage, the more bitter the chocolate tastes.
ChocolateData%>%
ggplot(aes(x = Rating, y = CocoaPct)) +
geom_point() +
labs(x = "Rating", y ="Cocoa percentage" ) +
geom_smooth(method = "lm", se = FALSE, col = "brown")
## `geom_smooth()` using formula 'y ~ x'
From the chart, I would like to see there is not very strong linear relationship between cocoa percentage and chocolate rating. So I try to go deeper.
model_1 <- lm(formula = Rating ~ CocoaPct, data = ChocolateData)
summary(model_1)
##
## Call:
## lm(formula = Rating ~ CocoaPct, data = ChocolateData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2071 -0.3196 0.0429 0.3178 1.7929
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.079388 0.126757 32.183 < 2e-16 ***
## CocoaPct -0.012461 0.001761 -7.076 2.12e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4717 on 1793 degrees of freedom
## Multiple R-squared: 0.02717, Adjusted R-squared: 0.02662
## F-statistic: 50.07 on 1 and 1793 DF, p-value: 2.122e-12
As we can see, the adjusted R-squared of 0.02662 and the p-value is 2.122e-12. lm is not a good model to measure the relationship between cocoa percentage and chocolate rating. But the negative slope as well as the chart implies that the higher the cocoa percent, the lower the rating of chocolate could be, which is the opposite to my assumption.
If we want to find a model to help with predicting the rating of the chocolate, we need to go back to the main dataset and add more variables into consideration.
chocolate_cor <- data.frame(ChocolateData$ReviewDate, ChocolateData$CocoaPct, ChocolateData$Rating)
names(chocolate_cor)[1:3] <- c("Review Date", "Cocoa Percent", "Rating")
chocolate_cor <- round(cor(chocolate_cor), 3)
chocolate_cor %>%
kbl(caption = "The relationship between Rating, Review date and Cocoa Percent", digits= 3) %>%
kable_classic( html="Cambria", full_width=FALSE)
| Review Date | Cocoa Percent | Rating | |
|---|---|---|---|
| Review Date | 1.000 | 0.038 | 0.100 |
| Cocoa Percent | 0.038 | 1.000 | -0.165 |
| Rating | 0.100 | -0.165 | 1.000 |
Relations <- corrplot(chocolate_cor, method = 'circle', type = 'upper', tl.srt = 30)
We can see from the chart that there is no strong connections between these variables, slightly negative connection between rating and cocoa percents. Very slightly positive connection between Review Date and Rating.
Buy Soma!
Pretty good!Most of the rates are between 2.75-3.75, mean of the rating are 3.186, median is 3.25.
In general, Canada and Italy are better choices, even though USA ranking first in company distribution, they do not produce the highest rating chocolate.
It is hard to say that there are some relationship between these two variables. Maybe slightly negative. I guess it is because the more cocoa percentage it is , the bitter the chocolate will be. And most people like sweet taste.Choosing 70%-75% cocoa percentage is most likely to buy awesome chocolate.
From my analysis, it is hard to say there are any factors that help with predicting the rating of chocolate. It is likely to buy higher chocolate if you choose to buy 70%-75% cocoa percentage. It would be better if the companies were from Australia, Switzerland, Italy, and Canada. To make things easier, buy Soma.
Furthermore, I think cocoa’s type and bar origin will play an important role in rating chocolate. I will go deeper and use these variables to find the relationship if I have a chance.
Thanks for going through the Chocolate review journey with me! Nevertheless, there are limitations to this simple research. I don’t really go deeper but use some common variables to describe the relationship. The latest data is from 2017, which is about four years ago. How is the situation now? Will Covid influence the transport of cocoa, therefore influent the produce of chocolate? I have no answers to this question. I hope I will be able to answer them someday by continuing my study.
1.Photo, https://www.pinterest.com/pin/586734657714494204
2.Healthline.“7 Proven Health Benefits of Dark Chocolate” https://www.healthline.com/nutrition/7-health-benefits-dark-chocolate?epik=dj0yJnU9bXYyeWVSd3ZMb0dFQjg4ZXg3ZGl0QjRWdU9YLWM5ZWkmcD0wJm49NEpQM3lVWmNtQ2kzY1lMNDdfYkUwQSZ0PUFBQUFBR0d0amU4#TOC_TITLE_HDR_9
3.Jason Horn,“What Is the Difference Between Bittersweet and Semisweet Chocolate?” https://greatist.com/eat/what-is-the-difference-between-bittersweet-chocolate-and-semisweet-chocolate
4.Coffee Quality database from CQI, https://www.kaggle.com/volpatto/coffee-quality-database-from-cqi
5.Chocolate Bar Ratings, https://www.kaggle.com/rtatman/chocolate-bar-ratings Note: I can a lot of inspiration from NO.4 &NO.5, after my finish the 1st version of my code, I made some adjustments based on some codes and logic, I truly appreciate it but all the codes of my final version are created by myself.